Computationally efficient audio coder

ABSTRACT

The present invention provides a computationally efficient technique for compression encoding of an audio signal, and further provides a technique to enhance the sound quality of the encoded audio signal. This is accomplished by including more accurate attack detection and a computationally efficient quantization technique. The improved audio coder converts the input audio signal to a digital audio signal. The audio coder then divides the digital audio signal into larger frames having a long-block frame length and partitions each of the frames into multiple short-blocks. The audio coder then computes short-block audio signal characteristics for each of the partitioned short-blocks based on changes in the input audio signal. The audio coder further compares the computed short-block characteristics to a set of threshold values to detect presence of an attack in each of the short-blocks and changes the long-block frame length of one or more short-blocks upon detecting the attack in the respective one or more short-blocks.

RELATED APPLICATIONS

This application is a Divisional of U.S. application Ser. No.10/466,027, filed on May 20, 2004, which claims the priority benefit andis a National Stage Application under 371 of PCT Application Serial No.PCT/IB01/01371, published on Jul. 18, 2001 as WO 02/056297 A1, whichapplications and publication are incorporated herein by reference intheir entirety.

FIELD OF THE INVENTION

This invention relates generally to processing of information signalsand more particularly pertains to techniques for encoding audio signalsinclusive of voice and music using a perceptual audio coder.

BACKGROUND

A Perceptual audio coder is an apparatus that takes series of audiosamples as input and compresses them to save disk space or bandwidth.The Perceptual audio coder uses properties of the human ear to achievethe compression of the audio signals.

The technique of compressing audio signals involves recording an audiosignal through a microphone and then converting the recorded analogaudio signal to a digital audio signal using an A/D converter. Thedigital audio signal is nothing but a series of numbers. The audio codertransforms the digital audio signal into large frames of fixed-length.Generally, the fixed length of each large frame is around 1024 samples.The analog signal is sampled at a specific rate (called the samplingfrequency) and this results in a series of audio samples. Typically aframe of samples is a series of numbers. The audio coder can onlyprocess one frame at a time. This means that the audio coder can processonly 1024 samples at a time. Then the audio coder transforms thereceived fixed-length frames (1024 samples) into a correspondingfrequency domain. The transformation to a frequency domain isaccomplished by using an algorithm, and the output of this algorithm isanother set of 1024 samples representing a spectrum of the input. In thespectrum of samples, each sample corresponds to a frequency. Then theaudio coder computes masking thresholds from the spectrum of samples.Masking thresholds are nothing but another set of numbers, which areuseful in compressing the audio signal. The following illustrates thecomputing of masking thresholds.

The audio coder computes an energy spectrum by squaring the spectrum ofthe 1024 samples. Then the samples are further divided into series ofbands. For example, the first 10 samples can be one band and the next 10samples can be another subsequent band and so on. Note that the numberof samples (width) in each band varies. The width of the bands isdesigned to best suit the properties of the human ear for listening tofrequencies of sound. Then the computed energy spectrum is added to eachof the bands separately to produce a grouped energy spectrum.

The audio coder applies a spreading function to the grouped energyspectrum to obtain an excitation pattern. This operation involvessimulating and applying the effects of sounds in one critical band to asubsequent (neighboring) critical band. Generally this step involvesconvolution with a spreading function, which results in another set offixed numbers.

Then, based on the tonal or noise-like nature of the spectrum in eachcritical band, a certain amount of frequency-dependent attenuation isapplied to obtain initial masking threshold values. Then, by using anabsolute threshold of hearing, the final masked thresholds are obtained.Absolute threshold of hearing is a set of amplitude values below whichthe human ear will not be able to hear.

Then the audio coder combines the initial masking threshold values withthe absolute threshold values to obtain the final masked thresholdvalues. Masked threshold value means a sound value below which a soundis not audible to the human ear (i.e., an estimate of maximum allowablenoise that can be introduced during quantization).

Using the masked threshold values, the audio coder computes perceptualentropy (PE) of a current frame. The perceptual entropy is a measure ofthe minimum number of bits required to code a current frame of audiosamples. In other words, the PE indicates how much the current frame ofaudio samples can be compressed. Various types of algorithms arecurrently used to compute the PE.

The audio coder receives the grouped energy spectrum, the computedmasking threshold values, and the PE and quantizes (compresses) theaudio signals. The audio coder has only a restricted number of bitsallocated for each frame depending on a bit rate. It distributes thesebits across the spectrum based on the masking threshold values. If themasking threshold value is high, then the audio signal is not importantand is hence represented using a smaller number of bits. Similarly, ifmasking threshold is low, the audio signal is important and hencerepresented using a higher number of bits. Also, the audio coder checksto ensure that the allocated number of bits for the audio signals is notexceeded. The audio coder generally applies a two-loop strategy toallocate and monitor the number of bits to the spectrum. The loops aregenerally nested and are called Rate Control and Distortion ControlLoops. The Rate Control Loop controls the distribution of the bits notto exceed the allocated number of bits, and the Distortion control loopdoes the distribution of the bits to the received spectrum. Quantizationis a major part of the perceptual audio coder. The performance of theaudio coder can be significantly improved by reducing the number ofcalculations performed in the control loops. The current quantizationalgorithms are very computation intensive and hence result in a sloweroperation.

Earlier we have seen that the audio coder receives one frame of samples(1024 samples in length) as input and converts the frame of samples intoa spectrum and then quantizes using masking thresholds. Sometimes theinput audio signal may vary quickly (when the properties of a signalchange abruptly). For example, if there is a sudden heavy beat in theaudio signal, and if the audio coder receives a frame of 1024 samples inlength (including the heavy beat) due to inadequate temporal masking ina signal including abrupt changes, a problem called pre-echo can occur.This is because the sound signal contains error after quantization, andthis error can result in an audible noise before the onset of the heavybeat, hence called the pre-echo. Heavy beats are also called ‘attacks.’A signal is said to have an attack if it exhibits a significant amountof non-stationarity within the duration of a frame under analysis. Forexample, sudden increase in amplitudes of a time signal within a typicalduration of analysis is an attack. To avoid this problem the audiosignal is coded with frames having smaller frame lengths instead of thelong 1024 samples. To keep continuity in the number of samples given asinput usually 8 smaller blocks of 128 samples are coded (8×128samples=1024 samples). This will restrict the heavy beat to one set of128 samples among 8 smaller blocks, and hence the noise introduced willnot spread to the neighboring smaller blocks as pre-echo. But thedisadvantage of coding in 8 smaller blocks of 128 samples is that theyrequire more bits to code than required by the larger blocks of 1024samples in length. So the compression efficiency of the audio coder issignificantly reduced. To improve the compression efficiency, the heavybeats have to be detected accurately so that the smaller blocks can beapplied only around the heavy beats. It is important that the heavybeats be accurately detected, or else pre-echo can occur. Also, a falsedetection of heavy beats can result in significantly reduced compressionefficiency. Current methods to detect the heavy beats use the PE.Calculating the PE is computationally very intensive and also not veryaccurate.

Also, we have seen earlier that the blocks that have attacks should becoded as smaller blocks having 128 samples and others as larger blockshaving 1024 samples. The smaller frame lengths of 128 samples are called‘short-blocks’, and the 1024 samples frame length are called‘long-blocks.’ We have also seen that the short-blocks require more bitsto code than the long-blocks. Also for each large frame there is a fixednumber of bits allocated. If we can intelligently save some bits whilecoding a long-block and use the saved bits in a short-block, thecompression efficiency of the audio coder can be significantlyincreased. For storing the bits, a ‘Bit Reservoir mechanism’ is needed.Since long-blocks do not need a large number of bits, the unused bitsfrom the long-blocks can be saved in the bit reservoir and used laterfor a short-block. Currently there are no efficient techniques to saveand allocate bits between long and short-blocks to improve thecompression efficiency of the audio coder.

The audio signal can be of two types (i) single channel or mono-signaland (ii) multi-channel or stereo signal to produce spatial effects. Thestereo signal is a multi-channel signal comprised of two channels,namely left and right channels. Generally the audio signals in the twochannels have a large correlation between them. By using thiscorrelation the stereo channels can be coded more efficiently. Insteadof directly coding the stereo channels, if their sum and differencesignals are coded and transmitted where the correlation is high, abetter quality of sound is achieved at a same bit rate. When the audiosignal is a stereo signal, the audio coder can operate in two modes (a)normal mode and (b) M-S mode. The M-S mode means encoding the sum anddifference of the left and right channels of the stereo. Currently thedecision to switch between the normal and M-S modes is based on the PE.As explained before, computing PE is very computation intensive andinconsistent.

Therefore, there is a need in the art for a computationally efficientquantization technique. Also, there is a need in the art for an improvedattack detection technique that is computationally less intensive andmore accurate, to improve the compression efficiency of the audio coder.In addition, there is a need in the art for a technique to allocate thebits between the long and short-blocks to improve the computationefficiency of the audio coder. Furthermore, there is also a need in theart for a technique that is computationally efficient and more accuratein switching between the normal and the M-S modes when the audio signalis a stereo signal.

SUMMARY OF THE INVENTION

The present invention provides an improved technique for detecting anattack in an input audio signal to reduce pre-echo artifacts caused byattacks during compression encoding of the input audio signal. This isaccomplished by providing a computationally efficient and more accurateattack detection technique. The improved audio coder converts the inputaudio signal to a digital audio signal. The audio coder then divides thedigital audio signal into larger frames having a long-block frame lengthand partitions each of the frames into multiple short-blocks. The audiocoder then computes short-block audio signal characteristics for each ofthe partitioned short-blocks based on changes in the input audio signal.The audio coder further compares the computed short-blockcharacteristics to a set of threshold values to detect presence of anattack in each of the short-blocks and changes the long-block framelength of one or more short-blocks upon detecting the attack in therespective one or more short-blocks.

Further, the improved audio coder increases compression efficiency byefficiently allocating bits between long and short-blocks. The audiocoder that is computationally efficient and more accurate in switchingbetween the normal and M-S modes when the audio signal is a stereosignal. In addition, the present invention also describes a techniquefor reducing the computational complexity of quantization.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is block diagram of a prior-art perceptual audio coder.

FIG. 2 is a block diagram of a perceptual audio coder according to theteaching of the present invention.

FIG. 3 is a block diagram of one example embodiment of computinginter-block differences.

FIG. 4 is a block diagram of one embodiment of major components of theQuantizer shown in FIG. 2 and their interconnections.

FIG. 5 is a flowchart illustrating the overall operation of theembodiment shown in FIG. 2.

FIG. 6 is a flowchart illustrating the operation of the Bit Allocatorshown in FIG. 4.

FIG. 7 is a flowchart illustrating the operation of the Quantizer shownin FIGS. 1 and 2 according to the teachings of the present invention.

FIG. 8 is a flowchart illustrating the overall operation of theembodiment shown in FIG. 2 when compression encoding a stereo audiosignal according to the teachings of the present invention.

FIG. 9 shows an example of a suitable computing system environment forimplementing embodiments of the present invention, such as those shownin FIGS. 1-8.

DETAILED DESCRIPTION

The present invention provides an improved audio coder by increasing theefficiency of the audio coder during compression of an input audiosignal. This is accomplished by providing computationally efficient andmore accurate attack detection and quantization technique. Also,compression efficiency is improved by providing a technique to allocatebits between long and short-blocks. In addition, the present inventionprovides an audio coder that is computationally efficient and moreaccurate in switching between the normal and M-S modes when the audiosignal is a stereo signal. The words ‘encode’ and ‘code’ are usedinterchangeably throughout this document to represent the same audiocompression scheme. Also the words ‘encoder’ and ‘coder’ are usedinterchangeably throughout this document to represent the same audiocompression system.

FIG. 1 shows a prior-art perceptual audio coder 100 including majorcomponents and their interconnections. Shown in FIG. 1 are Timefrequency generator 110, Psychoacoustic model 120, Quantizer 130, andBitStream Formatter 140. The technique of compressing audio signalsinvolves recording an audio signal through a microphone and thenconverting the recorded analog audio signal to a digital audio signalusing an A/D converter. The digital audio signal is nothing but a seriesof numbers.

The Time frequency generator 110 receives the series of numbers in largeframes (blocks) of fixed-length 105. Generally, the fixed length of eachframe is around 1024 samples (series of numbers). Time frequencygenerator 110 can only process one frame at a time. This means that theaudio coder 100 can process only 1024 samples at a time. The Timefrequency generator 110 then transforms the received fixed-length frames(1024 samples) into corresponding frequency domains. The transformationto the frequency domain is accomplished by using an algorithm, and theoutput of this algorithm is another set of 1024 samples called aspectrum of the input. In the spectrum, each sample corresponds to afrequency. Then the Time frequency generator 110 computes maskingthresholds from the spectrum. Masking thresholds are nothing but anotherset of numbers that are useful in compressing the audio signal. Thefollowing illustrates one example embodiment of computing maskingthresholds.

The Time frequency generator 110 computes an energy spectrum by squaringthe spectrum of 1024 samples. Then the samples are further divided intoseries of bands. For example, the first 10 samples can be one band andthe next 10 samples can be another subsequent band and so on. Note thatthe number of samples (width) in each band varies. The width of thebands is designed to best suit the properties of the human ear forlistening to frequencies of sound. Then the computed energy spectrum isadded to each of the bands separately to produce a grouped energyspectrum.

The Time frequency generator 110 then applies a spreading function tothe grouped energy spectrum to obtain an excitation pattern. Thisoperation involves simulating and applying the effects of sounds in onecritical band to a subsequent (neighboring) critical band. Generallythis step involves using a convolution algorithm between the spreadingfunction and the energy spectrum.

Based on the tonal or noise-like nature of the spectrum in each criticalband, a certain amount of frequency dependent attenuation is applied toobtain initial masking threshold values. Using an absolute threshold ofhearing, the final masked thresholds are obtained. Absolute threshold ofhearing is a set of amplitude values below which the human ear will notbe able to hear.

The Psychoacoustic model 120 combines the initial masking thresholdvalues with the absolute threshold values to obtain the final maskedthreshold values. Masked threshold value means a sound value below whichquantization noise is not audible to the human ear (it is an estimate ofthe maximum allowable noise that can be introduced during quantization).

Using the masked threshold values, the Psychoacoustic model 120 computesperceptual entropy (PE). The perceptual entropy is a measure of theminimum number of bits required to code a current frame of audiosamples. In other words, the PE indicates how much the current frame ofaudio samples can be compressed. Various types of algorithms arecurrently used to compute the PE.

The Quantizer 130 then receives the spectrum, the computed maskingthreshold values, and the PE, and compresses the audio signals. TheQuantizer 130 has only a specific number of bits allocated for eachframe. It distributes these bits across the spectrum based on themasking threshold values. If the masking threshold value is high, thenthe audio signal is not important and hence can be represented using asmaller number of bits and similarly, if the masking threshold is low,the audio signal is important and hence can only be represented using ahigher number of bits. Also, the Quantizer 130 checks to make sure thatthe allocated number of bits for the audio signals is not exceeded. TheQuantizer 130 generally applies a two-loop strategy to allocate andmonitor the number of bits to the received spectrum. The loops aregenerally nested and are called Rate control and Distortion controlloops. The Rate Control loop controls the global gain so that the numberof bits used to code the spectrum does not exceed the allocated numberof bits, and the Distortion control loop does the distribution of thebits to the received spectrum. Quantization is a major part of theperceptual audio coder 100. The performance of the Quatizer 130 can besignificantly improved by reducing the number of calculations performedin the control loops. The current quantization algorithms used in theQuantizer 130 are very computation intensive and hence result in sloweroperation.

BitStream formatter 140 receives the compressed audio signal (codedbits) from the Quatizer 130 and converts it into a desired format/syntax(specified coding standard) such as ISO MPEG-2 AAC.

FIG. 2 is a block diagram of one embodiment of a perceptual audio coder200 according to the teachings of the present invention. In addition towhat is shown in FIG. 1, in this embodiment the perceptual audio coder200 includes a transient detection module 210. The transient detectionmodule is coupled to receive the input audio signal. Also, the transientdetection module 210 is coupled to provide an input to the timefrequency generator 110 and psychoacoustic model 120.

In operation, the transient detection module 210 receives the inputaudio signal 105 as a series of numbers in frames of fixed-length andpartitions each of the frames into multiple short-blocks. In someembodiments, the fixed length is a long-block frame length of 1024samples of digital audio signal. The digital audio signal comprisesseries of numbers. The long-block is used when there is no attack in theinput audio signal. In some embodiments, the short-blocks have a framelength in the range of about 100 to 300 samples of digital audio signal.

The transient detection module 210 computes short-block audio signalcharacteristics for each of the partitioned short-blocks. In someembodiments, computing the short-block audio signal characteristicsincludes computing inter-block differences (xdiff(m) for an mthshort-block) and inter-block ratios, and further determining maximuminter-block difference and ratio, respectively. In some embodiments,computing the inter-block differences includes summing a square of thedifferences between samples in adjacent short-blocks. Further, in someembodiments, the inter-block ratios are computed to better isolate(detect) the attacks. In this embodiment, the inter-block ratios arecomputed by dividing the adjacent computed inter-block differences asfollows:

r[0]=xdiff[0]/pxdif

r[1]=xdiff[1]/xdiff[0]

r[2]=xdiff[2]/xdiff[1]

r[3]=xdiff[3]/xdiff[2]

r[4]=xdiff[4]/xdiff[3]

where ‘pxdif’ is xdiff_(p)[4] (which is xdiff[4] of the previous frame)

The transient detection module 210 compares the computed short-blockcharacteristics with a set of threshold values to detect the presence ofan attack in each of the short-blocks. Then the transient detectionmodule 210 changes the long-block frame length of the frame includingthe attack based on the outcome of the comparison, and inputs thechanged frame length to the time frequency generator 110 to reduce theeffect of the pre-echo caused by the attack. In some embodiments, thetime frequency generator uses short-blocks to restrict the attack to asmaller frame so that the attack does not spread to adjacent smallerframe lengths to reduce the pre-echo artifact caused by the attack. Inthis embodiment, the smaller frames have a frame length in the range ofabout 100 to 200 samples of digital audio signal.

FIG. 3 illustrates an overview of one embodiment of computinginter-block differences to detect the presence of an attack in an inputaudio signal according to the teachings of the present invention. Asexplained earlier with reference to FIGS. 1 and 2, the input audiosignal 305 is divided into large frames by a signal splitter 330 andprocessed by the perceptual audio coder 200 into frames. Each of theframes has a long-block frame length of 1024 samples of digital audiosignal. The transient detection module 210 detects the presence of anattack by using two adjacent incoming frames at a time. In the exampleembodiment shown in FIG. 3 the transient detection module 210 receivestwo adjacent current and previous frames 310 and 320, respectively. Alsoshown are the partitioned short-blocks 315 and 325 corresponding to theframes 310 and 320, respectively. In the embodiment shown in FIG. 3,each of the short-blocks 315 and 325 corresponding to the frames 310 and320, respectively, have frame lengths of 256 samples. The last fiveshort-blocks (the four short-blocks 315 from the frame 310 and oneadjacent short-block 325 from the frame 320) are used in detecting thepresence of an attack in the adjacent frame 320 before transformation tofrequency domain by the Time frequency generator 110.

The following computational sequence is used in detecting the presenceof an attack in the adjacent frame 320:

The inter block differences xdiff(m) 340 in the time domain are computedusing the following algorithm:

${{xdiff}(m)} = {\frac{4}{N}{\sum\limits_{j = 0}^{{N/4} - 1}\lbrack {{s( {j,m} )} - {s( {j,{m - 1}} )}} \rbrack^{2}}}$

where s(j,m) is the j'th time domain sample of the m'th short-block ands(j,m−1) corresponds to time domain samples of the last short-block ofthe adjacent frame 320. The Diff blocks 350 shown in FIG. 3 compute thedifference between two adjacent short-blocks 315 and 325. The ( )²blocks 360 in FIG. 3 compute the square of the respective computeddifferences. The Σblocks 370 compute the sum, and finally the xdiff(m)is computed as indicated in the above algorithm.

In some embodiments, the short-block frame lengths are tuned to theapplication in use. In these embodiments, distance between the largeframes is computed to determine an optimum size for the short-blockframe lengths. The following algorithm is used to compute the distancebetween the large frames:

xdiff(m)=d(Ŝ _(m) ,Ŝ _(m-1))

where Ŝ_(m) and Ŝ_(n-1) 380 are the signal sub-vectors for the m^(th)and (m−1)^(th) short-blocks, and d(•) is a function that returns adistance measure between the two vectors.

FIG. 4 illustrates one embodiment of the major components of theQuantizer 130 and their interconnections as shown in FIG. 2 used in abit allocation strategy according to the teachings of the presentinvention. Shown in FIG. 4 are Bit Allocator 410, Bit Reservoir 420, andMemory 425. The technique of bit allocation strategy according to theteachings of the present invention includes efficient distribution ofbits to different portions of the audio signal. Bits required to codethe current frame can be estimated from the perceptual entropy of thatframe. Extensive experimentation suggests that the number of bitsrequired to encode is considerably less for a larger frame length thanfor a smaller frame length. Also, it has been found that the largerframes generally require less than the average number of bits to encodelarge frames. The amount of reduction below the average number of bitsis a function of bit rate. Using this technique also results in largesavings of bits during stationary portions of the audio signal. Thetechnique of bit allocation strategy according to the teachings of thepresent invention is explained in detail in the following section.

The Quantizer 130 receives the large and small frames including thesamples of digital audio signal from the time frequency generator 110.Further, the Quantizer 130 receives the computed perceptual entropy fromthe psychoacoustic model 120 shown in FIG. 2. The Bit Allocator 410computes an average number of bits that can be allocated to each of thereceived large frames. In some embodiments, the Bit Allocator 410determines the average number of bits by using the long-block framelength and sampling frequency of the input audio signal. Further, theBit Allocator 410 computes a bit rate and a reduction factor based onthe computed bit rate, and the received perceptual entropy. In addition,the Bit Allocator 410 computes a reduced average number of bits that canbe allocated for each of the large frames using the computed reductionfactor. Further, the Bit Allocator 410 computes remaining bits bysubtracting the computed average number of bits using the computedreduced average number of bits. The Bit Allocator 410 includes a BitReservoir 420 to receive the remaining bits. The Bit Allocator 410allocates a reduced average number of bits to the current frame andstores the remaining bits in the Bit Reservoir 420 when the currentframe is a large frame. Further, the Bit Allocator allocates the reducednumber of bits along with the stored bits from the Bit Reservoir 420when the current frame is a small frame to improve the bit allocationbetween the large and small frames, to enhance sound quality of thecompressed audio signal. The Bit Allocator 410 repeats the above processof bit allocation to a next adjacent frame. In some embodiments, theallocation of bits to a small frame is based on number of bits availablein the Bit Reservoir 420, bit rate, and a scaling applied to thedenominator, which actually distributes the bits across continuoussequence of frames that use finer time resolution. At the same time, theBit Allocator 410 makes sure that the Bit Reservoir 420 is not depletedtoo much.

FIG. 4 also illustrates one embodiment of major components and theirinterconnections in the Quantizer 130 shown in FIG. 2 used in reducingcomputational complexity in the Quantizer 130 according to the teachingsof the present invention. Also shown in FIG. 4 are Rate Control Loop 430(also generally referred to as “Inner Iteration Loop”), Comparator 427,and Distortion Control Loop 440 (also generally referred to as “OuterIteration Loop”).

The Rate Control Loop 430 computes global gain, which is commonlyreferred to as “common_scalefac” for a given set of spectral values witha pre-determined value for the maximum number of bits available forencoding the frame (referred to as “available_bits”). The Rate ControlLoop arrives at a unique solution for the common_scalefac value for agiven set of spectral data for a fixed value of available_bits, so anyother variation of the Rate Control Loop must necessarily arrive at thesame solution. Efficiency of the Rate Control Loop is increased byreducing the number of iterations required to compute thecommon_scalefac value. The technique of reducing the number ofiterations required to compute the common_scalefac value according tothe teachings of the present invention is discussed in detail in thefollowing section.

The Quantizer 130 stores a start_common_scalefac value of a previousadjacent frame to use in quantization of a current frame. The RateControl Loop 430 computes the common_scalefac value for the currentframe using the stored start_common_scalefac value as a starting valueduring computation of iterations by the Rate Control Loop 430 to reducethe number of iterations required to compute the common_scalefac valueof the current frame. Further, the Rate control Loop 430 computescounted_bits using the common_scalefac value of the current frame. Thecomparator 427 coupled to the Rate control Loop compares the computedcount_bits with available_bits. The Rate Control Loop changes thecomputed common_scalefac value based on the outcome of the comparison.In some embodiments, the count_bits comprises bits required to encode agiven set of spectral values for the current frame.

The Distortion Control Loop 440 is coupled to the Rate Control Loop 430to distribute the bits among the samples in the spectrum based on themasking thresholds received from the psychoacoustic model. Also, theDistortion Control Loop 440 tries to allocate bits in such a way thatquantization noise is below the masking thresholds. The DistortionControl Loop 440 also sets the starting value of start_common_scalefacto be used in the Rate Control Loop 430.

FIG. 5 illustrates one example embodiment of a process 500 of detectingan attack in an input audio signal to reduce a pre-echo artifact causedby the attack during a compression encoding of the input audio signal.The process 500 begins with step 510 by receiving an input audio signaland converting the received input audio signal into a digital audiosignal. In some embodiments, the attack comprises a sudden increase insignal amplitude.

Step 520 includes dividing the converted digital audio signal into largeframes having a long-block frame length. In some embodiments, thelong-block frame length comprises 1024 samples of digital audio signal.In this embodiment, the samples of digital audio signal comprise seriesof numbers. In this embodiment, the long-block frame length comprises aframe length used when there is no attack in the input audio signal.

Step 530 includes partitioning each of the large frames into multipleshort-blocks. In some embodiments, partitioning large frames intoshort-blocks includes partitioning short-blocks having short-block framelengths in the range of about 100 to 300 samples.

Step 540 includes computing short-block characteristics for each of thepartitioned short-blocks based on changes in the input audio signal. Insome embodiments, the computing of the short-block characteristicsincludes computing inter-block differences and determining a maximuminter-block difference from the computed inter block differences. Insome embodiments, the computing of short-block characteristics furtherincludes computing inter-block ratios and determining a maximuminter-block ratio from the computed inter-block ratios. In thisembodiment, the computing of inter-block differences includes summing asquare of the differences between samples in adjacent short-blocks. Alsoin this embodiment the computing of the inter-block ratios includesdividing the adjacent computed inter-block differences. The process ofcomputing the short-block characteristics is discussed in more detailwith reference to FIG. 3.

Step 550 includes comparing the computed short-block characteristics toa set of threshold values to detect a presence of the attack in each ofthe short-blocks. Step 560 includes changing the long-block frame lengthof one or more large frames based on the outcome of the comparison toreduce the pre-echo artifact caused by the attack. In some embodiments,the changing of the long-block frame length means changing to includemultiple smaller frames to restrict the attack to one or more smallerframes so that the pre-echo artifact caused by the attack does notspread to the adjacent larger frames. In some embodiments, the smallerframe lengths include about 100 to 200 samples of digital audio signal.

FIG. 6 illustrates one example embodiment of an operation 600 of anefficient strategy for bit allocation to the large and small frames bythe Bit Allocator shown in FIG. 4 according to the present invention.The operation 600 begins with step 610 by computing an average number ofbits that can be allocated for each of the large frames. In someembodiments, the average number of bits is computed by determining thelong-block frame length, the sampling frequency of the input audiosignal, and the bit rate of the coding the input audio signal.

Step 620 includes computing a perceptual entropy for the current frameof audio samples using the masking thresholds computed as described indetail with reference to FIG. 1. Step 630 includes computing a bit rateusing a sampling frequency and the current frame length. Step 640includes computing a reduction factor based on the computed bit rate andthe perceptual entropy. Step 650 includes computing a reduced averagenumber of bits that can be allocated to each of the large frames usingthe computed reduction factor. Step 660 includes computing remainingbits by subtracting the computed average number of bits with thecomputed reduced average number of bits. Step 670 includes allocatingbits based on the large or small frame. In some embodiments, if thecurrent frame to be coded is large, then a reduced number of bits areallocated to the current frame and the remaining bits are stored in aBit Reservoir, and if the current frame to be coded is small, then thereduced number of bits are allocated along with the stored bits from theBit Reservoir. In some embodiments, the above-described operation 600repeats itself for a next frame adjacent to the current frame.

The following example further illustrates the operation of theabove-described operation 600 of the bit allocation strategy:

For example, if a given mono (single) audio signal at a bit rate of 64kbps is sampled at a sampling frequency of 44100 Hz (meaning there are44100 samples per second which needs to be encoded at a bit rate of64000 bits per second) and the long-block frame length is 1024 samples,the average number of bits are computed as follows:

${{Average}\mspace{14mu} {number}\mspace{14mu} {of}\mspace{14mu} {bits}} = {\frac{64000*1024}{44100} = { 1486.08 \sim 1486}}$

Therefore each frame is coded using 1486 bits. Each of the frames doesnot require the same number of bits. Also each of the frames does notrequire all of the bits. Assuming the first frame to be coded requires1400 bits, the remaining unused 86 bits are stored in the Bit Reservoirand can be used in succeeding frames. For the next adjacent frame wewill have a total of 1572 bits (1486 bits+86 bits in the Bit Reservoir)available for coding. For example, if the next adjacent frame is a shortframe more bits

can be allocated for coding.

In some embodiments, less than the average number of bits are used forencoding the large frames (using a reduction factor) and the remainingbits are stored in the Bit Reservoir. For example, in the above caseonly 1300 bits are allocated for each of the large frames. Then theremaining 186 bits (reduction factor) are stored in the Bit Reservoir.

Generally the Bit Reservoir cannot be used to store a large number ofremaining bits. Therefore, a maximum limit is set for the number of bitsthat can be stored in the Bit Reservoir, and anytime the number of bitsexceeds the maximum limit, the excess bits are allocated to the nextframe. In the above example, if the bit reservoir has exceeded themaximum limit, then the next frame will receive 1300 bits along with thenumber of bits by which the Bit reservoir has exceeded the limit.

In the above-described operation 600 when the next frame is a smallframe (small frames generally occur rarely), then more bits areallocated to the small frame from the Bit Reservoir. The number of extrabits that can be allocated to the small frame is dependent on twofactors. One is the number of bits present in the Bit Reservoir and theother is the number of consecutive small blocks present in the inputaudio signal. Basically the strategy described in the above operation600 is to remove bits from the long frames and to allocate the removedbits to the small frames as needed.

FIG. 7 illustrates one example embodiment of operation 700 of reducingcomputational iterations during compression by a perceptual encoder toimprove the operational efficiency of the perceptual audio coder. Theoperation 700 begins with step 710 by initializing common_scalefac forthe current frame. In some embodiments, the common_scalefac isinitialized using a common_scalefac value of a previous frame adjacentto the current frame. In some embodiments, this is the common_scalefacvalue obtained during the first call of the Rate Control Loop in theprevious frame of the corresponding channel and is denoted aspredicted_common_scalefac. In some embodiments, the initial value of thecommon_scalefac is set to start_common_scalefac+1 when thepredicted_common_scalefac value is not greater than the common_scalefacvalue. In some embodiments, the common_scalefac includes a global gainfor a given set of spectral values within the frame. The minimum valueof common_scalefac or the global gain is referred to asstart_common_scalefac value. The value of quantizer_change, which is thestep-size for changing the value of common_scalefac in the iterativealgorithm, is set to 1.

At 720 counted_bits associated with the current frame are computed. Insome embodiments, computing counted_bits includes qunatizing thespectrum of the current frame and then computing the number of bitsrequired to encode the quantized spectrum of the current frame.

At 730 a difference between the computed counted_bits and available_bitsare computed. In some embodiments, the available_bits are the number ofbits made available to encode the spectrum of the current frame. In someembodiments, the difference between the computed counted_bits and theavailable_bits are computed by comparing the computed counted_bits withthe available_bits.

At 740 the computed difference is compared with a pre-determined MAXDIFFvalue. Generally, the value of pre-determined MAXDIFF is set to be inthe range of about 300-500.

At 750 the common_scalefac value and quantizer_change value are resetbased on the outcome of the comparison. In some embodiments, thecommon_scalefac value is reset when the computed difference is greaterthan the pre-determined MAXDIFF, and the common_scalefac value ischanged based on the outcome of the comparison when the computeddifference is less than or equal to the pre-determined MAXDIFF value.

In some embodiments, the changing of the common_scalefac value based onthe outcome of the comparison further includes storing the computedcounted_bits along with the associated common_scalefac value, thencomparing the counted_bits with the available bits, and finally changingthe common_scalefac value based on the outcome of the comparison.

In some embodiments, changing the common_scalefac value based on theoutcome of the comparison further includes assigning a value to aquantizer_change, and changing the common_scalefac value using theassigned value to the quantizer_change and repeating the above stepswhen the counted_bits is greater than the available_bits. Someembodiments include restoring the counted_bits and outputting thecommon_scalefac value when the counted_bits is less than or equal toavailable_bits.

In some embodiments, resetting the common_scalefac value furtherincludes computing predicted_common_scalefac value based on storedcommon_scalefac value of the previous frame adjacent to the currentframe, and resetting the common_scalefac value. In case counted_bits isgreater than available_bits, common_scalefac is set to thestart_common_scalefac value+64, when the start_common_scalefac value+64is not greater than predicted_common_scalefac value, otherwisecommon_scalefac is set to predicted_common_scalefac and quantizer_changeis set to 64. Some embodiments include setting common_scalefac tostart_common_scalefac+32, and further setting quantizer_change to 32when the counted_bits is less than or equal to available_bits and thecommon_scalefac is not greater than start_common_scalefac+32 and ifpredicted_common_scalefac is greater than the present common_scalefac,recomputing counted bits. Further, some embodiments include setting thestart_common_scalefac+64 when the counted_bits is less than or equal toavailable_bits, and the common_scalefac value is greater than thestart_common_scalefac+32 and if predicted_common_scalefac is greaterthan the present common_scalefac, recomputing counted_bits.

FIG. 8 illustrates one example embodiment of operation 800 of stereocoding to improve sound quality according to the present invention. Theoperation 800 begins with step 810 by converting left and right audiosignals into left and right digital audio signals, respectively. Step820 divides each of the converted left and right digital audio signalsinto frames having a long-block frame length. In some embodiments, thelong-block frame length includes 1024 samples of digital audio signal.

Step 830 includes partitioning each of the frames into correspondingmultiple left and right short-blocks having short-block frame length. Insome embodiments, the short-block frame-length includes samples in therange of about 100 to 300 samples of digital audio signal.

Step 840 includes computing left and right short-block characteristicsfor each of the partitioned left and right short-blocks. In someembodiments, the computing the short-block characteristics includescomputing the sum and difference short-block characteristics by summingand subtracting respective samples of the digital audio signals in theleft and right short-blocks. In some embodiments, computing the sum anddifference short-block characteristics further includes computing sumand difference energies in each of the short-blocks in the left andright short-blocks by squaring each of the samples and adding thesquared samples in each of the left and right short-blocks. In addition,the short-block energy ratio is computed for each of the short-blockscomputed sum and difference energies, further determining a number ofshort-blocks whose computed short-block energy ratio exceeds apre-determined energy ratio value.

Step 850 includes encoding the stereo audio signal based on the computedshort-block characteristics. In some embodiments, the encoding of thestereo signal includes using a sum and difference compression encodingtechnique to encode the left and right audio signals based on thedetermined number of short-blocks exceeding the pre-determined energyratio value. In some embodiments, the pre-determined energy value isgreater than 0.75 and less than 0.25.

FIG. 9 shows an example of a suitable computing system environment 900for implementing embodiments of the present invention, such as thoseshown in FIGS. 1-8. Various aspects of the present invention areimplemented in software, which may be run in the environment shown inFIG. 9 or any other suitable computing environment. The presentinvention is operable in a number of other general purpose or specialpurpose computing environments. Some computing environments are personalcomputers, server computers, hand held devices, laptop devices,multiprocessors, microprocessors, set top boxes, programmable consumerelectronics, network PCS, minicomputers, mainframe computers,distributed computing environments, and the like. The present inventionmay be implemented in part or in whole as computer-executableinstructions, such as program modules that are executed by a computer.Generally, program modules include routines, programs, objects,components, data structures and the like to perform particular tasks orimplement particular abstract data types. In a distributed computingenvironment, program modules may be located in local or remote storagedevices.

FIG. 9 shows a general computing device in the form of a computer 910,which may include a processing unit 902, memory 904, removable storage912, and non-volatile memory 908. Computer 910 may include—or haveaccess to a computing environment that includes—a variety ofcomputer-readable media, such as volatile 906 and non-volatile memory908, removable and non-removable storages 912 and 914, respectively.Computer storage includes RAM, ROM, EPROM & EEPROM, flash memory orother memory technologies, CD-ROM, digital versatile disks (DVD) orother optical disk storage, magnetic cassettes, magnetic tape, magneticdisk storage or other magnetic storage devices, or any other mediumcapable of storing computer-readable instructions. Computer 910 mayinclude—or have access to a computing environment that includes—input916, output 918, and a communication connection 920. The computer 910may operate in a networked environment using a communication connection920 to connect to one or more remote computers. The remote computer mayinclude a personal computer, server, router, network PC, a peer deviceor other common network node, or the like. The communication connection920 may include a local area network (LAN), a wide area network (WAN) orother networks.

CONCLUSION

The above-described invention increases compression efficiency byproviding a technique to allocate bits between long and short-blocks.Also, the present invention significantly enhances the sound quality ofthe encoded audio signal by more accurately detecting an attack andreducing pre-echo artifacts caused by attacks. In addition, the presentinvention provides an audio coder that is computationally efficient andmore accurate in switching between the normal and the M-S modes when theaudio signal is a stereo signal.

The above description is intended to be illustrative, and notrestrictive. Many other embodiments will be apparent to those skilled inthe art. The scope of the invention should therefore be determined bythe appended claims, along with the full scope of equivalents to whichsuch claims are entitled.

1. (canceled)
 2. A method for processing an audio signal, comprising:converting the audio signal into a digital audio signal; dividing thedigital audio signal into large frames having a long-block frame length;partitioning each of the large frames into multiple short-blocks;computing short-block audio signal characteristics for each of theshort-blocks based on changes in the input audio signal; comparing thecomputed short-block audio signal characteristics to a set of thresholdvalues to detect a presence of the attack in each of the short-blocks;and changing the long-block frame length of one or more large framesbased on the outcome of the comparison to reduce the pre-echo artifactcaused by the attack.
 3. The method of claim 2, wherein detecting theattack comprises: detecting a sudden increase in amplitude within thelong-block frame length.
 4. The method of claim 2, wherein thelong-block frame length comprises 1024 samples of digital audio signal.5. The method of claim 4, wherein the samples of digital audio signalcomprise series of numbers.
 6. The method of claim 5, wherein thelong-block frame length comprises a frame length used when there is noattack in the input audio signal.
 7. The method of claim 5, wherein theshort-blocks comprise: short-blocks having short-block frame lengths inthe range of about 100 to 300 samples.
 8. The method of claim 5, whereincomputing the short-block audio signal characteristics furthercomprises: computing inter-block differences; and determining a maximuminter-block difference from the computed inter-block differences.
 9. Anapparatus to detect an attack in an input digital audio signal to reducea pre-echo artifact caused by the attack during compression encoding ofthe input digital audio signal, comprising: a time frequency generatorto receive the digital audio signal and divide the digital audio signalinto large frames having a long-block frame length, and to furtherpartition each of the large frames into multiple short-blocks; and atransient detection module coupled to the time frequency generator toreceive the multiple short-blocks and compute short-block audio signalcharacteristics for each of the received multiple short-blocks based onchanges in the input digital audio signal, wherein the transientdetection module compares the computed short-block audio signalcharacteristics to a set of threshold values to detect a presence of theattack in each of the multiple short-blocks, and the transient detectionmodule further changes the long-block frame length of one or more largeframes including the attack based on the outcome of the comparison,wherein the time frequency generator receives the changed one or morelarge frames and compresses the changed one or more large frames toreduce the pre-echo artifact caused by the attack.
 10. The apparatus ofclaim 9, wherein the attack comprises: a sudden increase in amplitudewithin the long-block frame length of the large frame of digital audiosignal.
 11. The apparatus of claim 10, wherein the long-block framelength comprises 1024 samples of digital audio signal.
 12. A computerreadable storage device comprising instructions that when executed by aprocessor execute a process for processing an audio signal by:converting the audio signal into a digital audio signal; dividing thedigital audio signal into large frames having a long-block frame length;partitioning each of the large frames into multiple short-blocks;computing short-block audio signal characteristics for each of theshort-blocks based on changes in the input audio signal; comparing thecomputed short-block audio signal characteristics to a set of thresholdvalues to detect a presence of the attack in each of the short-blocks;and changing the long-block frame length of one or more large framesbased on the outcome of the comparison to reduce the pre-echo artifactcaused by the attack.
 13. The computer readable storage device of claim12, wherein detecting the attack comprises: detecting a sudden increasein amplitude within the long-block frame length.
 14. The computerreadable storage device of claim 12, wherein the long-block frame lengthcomprises 1024 samples of digital audio signal.
 15. The computerreadable storage device of claim 14, wherein the samples of digitalaudio signal comprise series of numbers.
 16. The computer readablestorage device of claim 15, wherein the long-block frame lengthcomprises a frame length used when there is no attack in the input audiosignal.
 17. The computer readable storage device of claim 15, whereinthe short-blocks comprise: short-blocks having short-block frame lengthsin the range of about 100 to 300 samples.
 18. The computer readablestorage device of claim 15, wherein computing the short-block audiosignal characteristics further comprises: computing inter-blockdifferences; and determining a maximum inter-block difference from thecomputed inter-block differences.
 19. A method to detect an attack in aninput digital audio signal to reduce a pre-echo artifact caused by theattack during compression encoding of the input digital audio signal,comprising: receiving the digital audio signal and dividing the digitalaudio signal into large frames having a long-block frame length, andfurther partitioning each of the large frames into multipleshort-blocks; receiving the multiple short-blocks and computingshort-block audio signal characteristics for each of the receivedmultiple short-blocks based on changes in the input digital audiosignal; comparing the computed short-block audio signal characteristicsto a set of threshold values to detect a presence of the attack in eachof the multiple short-blocks; changing the long-block frame length ofone or more large frames including the attack based on the outcome ofthe comparison; and receiving the changed one or more large frames andcompressing the changed one or more large frames to reduce the pre-echoartifact caused by the attack.
 20. The method of claim 19, wherein theattack comprises: a sudden increase in amplitude within the long-blockframe length of the large frame of digital audio signal.
 21. The methodof claim 20, wherein the long-block frame length comprises 1024 samplesof digital audio signal.